9/20/2018

University of Arkansas Statistics Seminar

Introduction

  • Climate change is well understood globally.
  • Climate change is less well understood locally.
  • Need for spatailly explicit reconstructions of climate variables.
  • Problem: data souruces are messy and noisy.

Introduction

Predicting the future by learning from the past

Introduction

Predicting the future by learning from the past

Introduction

Predicting the future by learning from the past

Introduction

Predicting the future by learning from the past

Learning about the past

Climate proxy data

  • Many ecological and phyiscal processes respond to climate over different time scales.
    • Tree rings, corals, forest landscapes, ice rings, lake levels, etc.


  • These processes are called climate proxies.
    • They are proxy measurements for unobserved climate.
    • Noisy and messy.
    • Respond to a wide variety of non-climatic signals.

Pollen Data

Pollen Data

Pollen Data

Fossil Pollen Data

Model Framework

  • Bayesian hierarchical model.

\(\begin{align*} [\mathbf{Z}, \boldsymbol{\theta}_D, \boldsymbol{\theta}_P | \mathbf{y}] & \propto [\mathbf{y} | \boldsymbol{\theta}_D, \mathbf{Z}] [\mathbf{Z} | \boldsymbol{\theta}_P] [\boldsymbol{\theta}_D] [\boldsymbol{\theta}_P] \end{align*}\)

Model Framework

  • Bayesian hierarchical model.

\(\begin{align*} \color{cyan}{[\mathbf{Z}, \boldsymbol{\theta}_D, \boldsymbol{\theta}_P | \mathbf{y}]} & \propto [\mathbf{y} | \boldsymbol{\theta}_D, \mathbf{Z}] [\mathbf{Z} | \boldsymbol{\theta}_P] [\boldsymbol{\theta}_D] [\boldsymbol{\theta}_P] \end{align*}\)

  • Posterior.

Model Framework

  • Bayesian hierarchical model.

\(\begin{align*} \color{cyan}{[\mathbf{Z}, \boldsymbol{\theta}_D, \boldsymbol{\theta}_P | \mathbf{y}]} & \propto \color{red}{[\mathbf{y} | \boldsymbol{\theta}_D, \mathbf{Z}]} [\mathbf{Z} | \boldsymbol{\theta}_P] [\boldsymbol{\theta}_D] [\boldsymbol{\theta}_P] \end{align*}\)

  • Posterior.

  • Data Model

Model Framework

  • Bayesian hierarchical model.

\(\begin{align*} \color{cyan}{[\mathbf{Z}, \boldsymbol{\theta}_D, \boldsymbol{\theta}_P | \mathbf{y}]} & \propto \color{red}{[\mathbf{y} | \boldsymbol{\theta}_D, \mathbf{Z}]} \color{blue}{[\mathbf{Z} | \boldsymbol{\theta}_P]} [\boldsymbol{\theta}_D] [\boldsymbol{\theta}_P] \end{align*}\)

  • Posterior.

  • Data Model.

  • Process Model.

Model Framework

  • Bayesian hierarchical model.

\(\begin{align*} \color{cyan}{[\mathbf{Z}, \boldsymbol{\theta}_D, \boldsymbol{\theta}_P | \mathbf{y}]} & \propto \color{red}{[\mathbf{y} | \boldsymbol{\theta}_D, \mathbf{Z}]} \color{blue}{[\mathbf{Z} | \boldsymbol{\theta}_P]} \color{orange}{[\boldsymbol{\theta}_D] [\boldsymbol{\theta}_P]} \end{align*}\)

  • Posterior.

  • Data Model.

  • Process Model.

  • Prior Model.

Data Model

\(\begin{align*} [\mathbf{Z}, \boldsymbol{\theta}_D, \boldsymbol{\theta}_P | \mathbf{y}] & \propto \color{red}{[\mathbf{y} | \boldsymbol{\theta}_D, \mathbf{Z}]} [\mathbf{Z} | \boldsymbol{\theta}_P] [\boldsymbol{\theta}_D] [\boldsymbol{\theta}_P] \end{align*}\)

Data model

  • Describes how the data are collected and observed.
  • Researchers take sediment samples from a lake.
  • Take 1cm\(^3\) cubes along the length of the sediment core.
  • In each cube, researcher counts the first \(N\) pollen grains and identifies to species.
  • Raw data are counts of each species.

Data Model

For location \(\mathbf{s}\) and time \(t\),

\(\begin{align*} \mathbf{y} \left( \mathbf{s}_i, t \right) & = \left( y_{1} \left( \mathbf{s}_i, t \right), \ldots, y_{d} \left( \mathbf{s}_i, t \right) \right)' \end{align*}\)

is an observation of a \(d\)-dimensional compositional count.

  • \(y_{j} \left( \mathbf{s}_i, t \right)\) is the count of species \(j\) in the sample at location \(\mathbf{s}_i\) and time \(t\).
  • Compositional count data.
    • Total count is not informative of the absolute composition.
    • Informative of the relative proportions \(p_{j} \left( \mathbf{s}_i, t \right)\) only.

Data Model

  • Compositional count vector \(\mathbf{y} \left( \mathbf{s}_i, t \right)\) a function of latent proportions \(\mathbf{p}\left( \mathbf{s}_i, t \right)\).


\(\begin{align*} \mathbf{y}\left( \mathbf{s}_i, t \right) | \mathbf{p}\left( \mathbf{s}_i, t \right) & \sim \operatorname{Multinomial} \left( N\left( \mathbf{s}_i, t \right), \mathbf{p}\left( \mathbf{s}_i, t \right) \right) \end{align*}\)


  • \(N\left( \mathbf{s}_i, t \right) = \sum_{j=1}^d y_{j}\left( \mathbf{s}_i, t \right)\) is the total count observed (fixed and known) for observation at location \(\mathbf{s}_i\) and time \(t\).

  • Compositional count vector \(\mathbf{y} \left( \mathbf{s}_i, t \right)\) a function of latent proportions \(\mathbf{p}\left( \mathbf{s}_i, t \right)\).


Overdispersion

  • The pollen data are highly variable and overdispersed.


\(\begin{align*} \mathbf{p}\left( \mathbf{s}_i, t \right) | \boldsymbol{\alpha}\left( \mathbf{s}_i, t \right) & \sim \operatorname{Dirichlet} \left( \boldsymbol{\alpha}\left( \mathbf{s}_i, t \right) \right) \end{align*}\)


  • Marginalize out \(\mathbf{p} \left( \mathbf{s}_i, t \right)\) to get Dirichlet-multinomial.


\(\begin{align*} \mathbf{y}\left( \mathbf{s}_i, t \right) | \boldsymbol{\alpha}\left( \mathbf{s}_i, t \right) & \sim \operatorname{Dirichlet-Multinomial} \left( N\left( \mathbf{s}_i, t \right), \boldsymbol{\alpha}\left( \mathbf{s}_i, t \right) \right) \end{align*}\)


Overdispersion

  • We model the Dirichlet-multinomial count data using the log link function:


  • \(\begin{align*} \operatorname{log} \left( \boldsymbol{\alpha} \left( \mathbf{s}_i, t \right) \right) & = \mathbf{z}\left( \mathbf{s}_i, t \right) \boldsymbol{\beta}. \end{align*}\)


  • \(\mathbf{z}\left( \mathbf{s}_i, t \right)\)' is a \(q\)-dimensional vector of covariates.


  • \(\boldsymbol{\beta}\) is a \(q \times d\) dimensional matrix of regression coefficients.

Calibration

  • The \(\mathbf{z} \left( \mathbf{s}_i, t \right)\)s are observed only at \(t\) = 1.


  • Calibration:
    • Estimate \(\boldsymbol{\beta}\) using:
      • \(\left( \mathbf{y} \left( \mathbf{s}_1, 1 \right), \ldots, \mathbf{y} \left( \mathbf{s}_n, 1 \right) \right)'\) and
      • \(\left( \mathbf{z} \left( \mathbf{s}_1, 1 \right), \ldots, \mathbf{z} \left( \mathbf{s}_n, 1 \right) \right)'\).


  • Reconstruction:
    • Use estimated \(\boldsymbol{\beta}\)s and fossil pollen \(\mathbf{y} \left( \mathbf{s}, t \right)\) to predict unobserved \(\mathbf{z}\left( \mathbf{s}, t \right)\).


Calibration Model

Non-linear Calibration Model

  • Vegetation response to climate is non-linear.


  • \(\begin{align*} \operatorname{log} \left( \boldsymbol{\alpha} \left( \mathbf{s}_i, t \right) \right) & = \mathbf{B} \left( \mathbf{z}\left( \mathbf{s}_i, t \right) \right) \boldsymbol{\beta} \end{align*}\)


  • \(\mathbf{B} \left( \mathbf{z}\left( \mathbf{s}_i, t \right) \right)\) is a basis expansion of the covariates \(\mathbf{z}\left( \mathbf{s}_i, t \right)\).
    • Use B-splines or Gaussian Processes as a basis.


  • Note that for \(t \neq 1\), the \(\mathbf{z} \left( \mathbf{s}_i, t \right)\)s are unobserved.

Non-linear Calibration Model

Process Model

\(\begin{align*} [\mathbf{Z}, \boldsymbol{\theta}_D, \boldsymbol{\theta}_P | \mathbf{y}] & \propto [\mathbf{y} | \boldsymbol{\theta}_D, \mathbf{Z}] \color{blue}{[\mathbf{Z} | \boldsymbol{\theta}_P]}[\boldsymbol{\theta}_D] [\boldsymbol{\theta}_P] \end{align*}\)

Process Model

Dynamic Model

  • We are interested in estimating the latent process \(\mathbf{z} \left( \mathbf{s}, t \right)\).


  • The model can accommodate:
    1. continuous vs. discrete space (geostatistical vs. CAR models).
    2. continuous vs. discrete time (stochastic process vs. AR models).


  • For now, we focus on continuous space and discrete time.

Dynamic Model

  • For \(\mathbf{z} \left(t \right) = \left( \mathbf{z} \left(\mathbf{s}_1, t \right)', \ldots, \mathbf{z} \left(\mathbf{s}_n, t \right)' \right)\), we assume:

\(\begin{align*} \mathbf{z} \left(t \right) - \mathbf{X} \left( t \right) \boldsymbol{\gamma} & = \mathbf{M}\left(t\right) \left( \mathbf{z} \left(t-1 \right) - \mathbf{X} \left( t \right) \boldsymbol{\gamma} \right) + \boldsymbol{\eta} \left(t \right) \end{align*}\)


  • \(\mathbf{M}(t) = \rho \mathbf{I}_n\) is a propogator matrix.
  • \(\mathbf{X} \left(t \right) \boldsymbol{\gamma}\) are the fixed effects from covariates like latitude, elevation, etc.
  • \(\boldsymbol{\eta} \left( t \right) \sim \operatorname{N} \left( \mathbf{0}, \tau^2 \mathbf{R}\left( \boldsymbol{\phi} \right) \right)\).
  • \(\tau^2\) is the spatial process variance.
  • \(\mathbf{R} \left( \boldsymbol{\phi} \right)\) is a Mátern spatial covariance matrix with parameters \(\boldsymbol{\phi}\).

Elevation covariates

Time Uncertainty

  • Each fossil pollen observation includes estimates of time uncertainty.
    • The time of the observation is uncertain.
    • Weight the likelihoods according to age-depth model.
    • Posterior distribution of ages.
  • For each observation fossil pollen observation an age-depth model gives a posterior distribution over dates.
    • Define \(\omega \left(\mathbf{s}_i, t \right)\) as P(age \(\in (t-1, t)\)).
    • \([\mathbf{y} \left( \mathbf{s}_i, t \right) | \boldsymbol{\alpha} \left( \mathbf{s}_i, t \right) ] = \prod_{t=1}^T [\mathbf{y} \left( \mathbf{s}_i, t \right) | \boldsymbol{\alpha} \left( \mathbf{s}, t \right)]^{\omega_\left(\mathbf{s}_i, t \right)}\).

Simulation Study

Simuated data

Simulated Reconstruction

Simulated Reconstruction Temporal Trend

Reconstruction

Reconstruction over time

Reconstruction Inference

  • Current methods are site-level "transfer function" methods.
    • These methods ignore elevation, temporal autocorrleation, and spatial autocorrelation.
    • Sensitive to the data.
    • Poor quantification of uncertainty.
    • Unclear how to choose among models.
  • The spatial method is statistically principled.
    • Has higher power.
    • Smaller uncertainties that change with data (sample size, signal coherence, etc.).
    • Can use model selection methods (information criterion, etc).

Reconstruction Inference

Reconstruction Inference

Conclusion

Conclusion

Model framework opens the door to answering meaningful questions:

  • Do pollen distributions change with elevation?
    • Covariate-sensitive parameterizations.
  • Do pollen distributions change over space or time?
    • Regression coefficients vary over space/time.
  • How to combine multiple proxies (tree rings, pollen, etc)?
    • Each proxy gets its own data model.
    • Proxies link to dynamic space-time process.

Thanks for the attention